Introduction

This markdown serves as a companion to Chapters 13 and 14 of the Statistical Rethinking book by Professor Richard McElreath as presented on YouTube from his Winter 2019 lectures 15-18, plus additional notes from the book. Chapter13 starts with lecture 15. You need the following packages:

require(tidyverse)
require(gridExtra)
require(ggridges)
require(rstan)
require(rethinking)
require(dagitty)
require(MASS)
require(ellipse)

Lecture 15 - Models with Memory

An introduction to multi-level Bayesian models

Musicologist called Clive Wearing. In 1985 he got herpes but in his brain. It caused brain damage, affecting much of his hippocampus and he got anterograde amnesia. He kept old memories but lost the ability to form new memories. Can still play the piano but can’t remember what happened one minute ago. The statistical models we have considered so far, which have all been fixed effects models have also had anterograde amnesia. Any time we have moved to a new cluster in our data, the model forgets what it has seen so far.

Multilevel models are models that remember. By remember they pool information. Here, the properties of clusters of information in our data come from a population. The inferred population defines pooling. If previous clusters improve your guess about a new clusters, then you want to be using this pooling. However, the order that the clusters are ‘visited’ does not matter. Imagine you’re visiting cafes in various European countries. In a lot of ways they are similar. Berlin and Paris have different types of cafe. How long does the coffee take to come? If we’ve never been to a cafe, the first time we go we get an expectation. If our coffee took 5 minutes to get to us in Paris, you have an expectation about how long coffee should take to get to you in Berlin, but not exactly the same. The fixed effect only method, we would have no expectation. Also, when we visit either cafe, we also update information overall regarding how long coffee takes to arrive generally - so we update our expectation for cafes in both Berlin and Paris. We’re going to do this statistically.

Consider this more explicitly. We programme a robot to visit two cafes and estimate the waiting time with a Bayesian model. The robot begins with a vague prior for the waiting time with a mean of 5 mins and SD of 1. At the first cafe the robot observes a waiting time of 4 mins, and updates its prior using Bayes theorem, giving it a posterior. However, when it moves to the second cafe, what should its prior be? It could just use the same prior as for the first cafe. However, this assumes implicitly that the cafes have the same average waiting time. So, the robot can do better by treating the cafes as a population, and using the distribution of waiting times across cafes as it’s prior expectation. So here, the prior is actually learned from the data. So the robot tracks the parameters of waiting times at each cafe, as well as at least two parameters to describe the population of cafes (mean and sd). When the robot observes new cafes, everything is updated, both the estimates for each cafe and the estimates for the population.

How much information is transferred across our units (like cafes) is dependent on their variation. If you constantly eat spicy chilli’s, you get less stomach bugs. However, they aren’t consistent, and the spiciness can vary a lot. Chillis are variable. So if you use your expectation from the whole population, harder to learn and update. Learning this variation is key.

Cause and Reconciliation

We’re fighting a battle on two fronts remember, where there is no unique solution but lots of good ones:

  1. Causal inference - don’t make causal salad
  2. Functional inference - estimation not trivial

So in multilevel model, we’re addressing the second one. trying to get more precise estimates. This is assuming our causal model is correct…

As a default we should use multilevel regression. Often little harm in doing this, and more often makes a big difference. Opt-in and opt-out of organ donor lists in Europe. Germany has high percentage of people who think people should donate their organs, but are opt-in, and so their actual organ donation consent percentage (legally) is extremely low. The default is important. Multilevel models are like opt-out systems. Usually estimate better and should have to justify not using them. This isn’t how things are done though, often assumed that we should start simple and work up.

The benefits of multilevel models

Dealing with clustering, the Russian dolls. Pseudoreplication. Benefits discussed in chapter one repeated here for clarity:

  1. Improved estimates for repeated sampling - When more than one observation arises from the same individual, location, or time, then single-level models either horribly underfit or overfit the data.
  2. Improved estimates for unbalanced sampling - When some of these observations are sampled more than others, multilevel models automatically deals with uncertainty across these different levels. This prevents oversampled clusters dominating our inference.
  3. Estimates of variation - If our research questions include variation among individuals or groups, multilevel models explicitly model variation.
  4. Avoid averaging, retain variation - Frequently, people average data over clusters to construct variables for analysis in fixed-effects models. Removing variation can be dangerous and our data transformation can be arbitrary. Multilevel models preserve this uncertainty and quantify the variation.

Examples from what we’ve looked at before - !Kung from families, species in clades, nations in continents, applicants in departments.

Going to learn how shrinkage and pooling work. Now we have to describe the distribution that we think the characteristics of the clusters follow. Maximum entropy helps us here. Also our estimation is trickier, which is aided by our MCMC algorithms with ulam. Also a bit trickier to understand because we make predictions at different (multi) levels of our data, making comparisons with WAIC etc. slightly more subtle. We have to make more descisions about the parameters we want to compare for example.

Example - Multilevel tadpoles

Reed frog tadpoles, quasi-experiment. Eggs suspended on leaves above buckets (ponds). Tadpoles fall in to the water below. Different densities and sizes and predation (presence) in each of the ponds. Interested in the number of surviving tadpoles. Focussing on variation across tanks.

Structure: Tadpoles in tanks at different densities Outcome: Number surviving (binomial)

We will ignore density and predation for now. We want to know if different tanks have different survival.

Two basic models: 1. Dummy variable for each tank (done this previously) and 2. Multilevel model with varying intercepts.

Our basic model would be like this

\[ \begin{aligned} S_i &\sim \operatorname{Binomial}(N_i, p_i) \\ \operatorname{logit}(p_i) &= \alpha_{\operatorname{TANK}[i]} \\ \alpha_j &\sim \operatorname{Normal}(0,1.5) \,\,\,\,\,\,\,\,\,\,\, \operatorname{for}j = 1..48 \\ \end{aligned} \] Where our survival \(S\) for each observation \(i\) is given by a binomial distribution with probability \(p\), and we estimate a different intercept for each tank \(j\) (or pond). Only the data from the tank at hand informs the \(\alpha\) from the tank. But surely we expect that survival is linked between.

Let implement this then. First lets look at the data

data(reedfrogs)
tadpole <- reedfrogs %>% 
   mutate(tank = 1:n())

# basic plot
ggplot(tadpole, aes(x = tank, y = surv, fill = surv)) +
   geom_col(show.legend = F) +
   labs(x = "Tank number", y = "Number survived") +
   scale_x_continuous(breaks = 1:48) +
   theme_bw(base_size = 14) +
   theme(panel.grid = element_blank())

# data as a list for our models
tadlist <- list(S = tadpole$surv,
                N = tadpole$density,
                tank = tadpole$tank)

We can see here already that there are different numbers of survivals, but very different numbers, probably owing to the density. But ignoring density for now, lets fit our fixed-effect only GLM.

m13.1 <- ulam(alist(S ~ dbinom(N,p),
                    logit(p) <- a[tank],
                    a[tank] ~ dnorm(0,1.5)),
              data = tadlist,
              chains = 4, cores = 4,
              log_lik = TRUE)
## Trying to compile a simple C file

But now we want to adaptively pool information across tanks. We do this with our multilevel model. What we are going to do is allow the intercepts to vary between tanks.

\[S_i \sim \operatorname{Binomial}(N_i, p_i)\] \[\operatorname{logit}(p_i) = \alpha_{\operatorname{TANK}[i]}\]

\[\alpha_j \sim \operatorname{Normal}(\bar{\alpha},\sigma)\] \[\bar{\alpha} \sim \operatorname{Normal}(0,1.5)\] \[\sigma \sim \operatorname{Exponential}(1)\] We allow the intercepts to vary by adding parameters into our intercepts distribution. This addition of parameters is the multilevel bit. Each \(\alpha_j\) has a prior with parameters. And then these parameters have priors themselves, so its just parameters within parameters, priors within priors. This is called a hyperprior. Survival across tanks has a distribution, this distribution is the prior for each tank, and this distribution needs its own prior.

Varying intercepts

Also called random intercepts. Random is silly though. So is varying. It is distinct because you learn the prior from the data. We regularise, but now we learn this from the data. The intercepts learn from each other and this pools across.

Lets do it! It’s really this easy

m13.2 <- ulam(alist(S ~ dbinom(N,p),
                    logit(p) <- a[tank],
                    a[tank] ~ dnorm(a_bar,sigma),
                    a_bar ~ dnorm(0,1.5), 
                    sigma ~ dexp(1)),
              data = tadlist,
              chains = 4, cores = 4,
              log_lik = TRUE)
## Trying to compile a simple C file

We get 50 parameters out of this model with precis (don’t look at it cos it’s big). Two that are \(\bar{a}\) and \(\sigma\) (the overall parameters for the intercept), and then 48 which are the per-tank intercepts. The fixed effects model would just have 48. Lets compare these models.

compare(m13.1, m13.2, func = WAIC)
##           WAIC       SE    dWAIC      dSE    pWAIC      weight
## m13.2 200.6119 7.209497  0.00000       NA 21.24809 0.998976812
## m13.1 214.3796 4.453243 13.76762 3.459656 25.49380 0.001023188

Our predictive performance for our varying intercepts model is better (not hugely - but definite increase). Also, note that for pWAIC it is lower for our multilevel model than it is for our fixed effects model. Why? Ended up with a stronger prior. Effective number is also lower than the literal number because we regularised our prior. For multilevel models, the fit to the sample can decrease but the out of sample fit is better - more regularising.

Shrinkage

Now lets compare the intercept differences between our fixed effect only model and our multilevel model. We just have to extract samples from the posterior and average this time. No explicit predictions needed. Just have the posterior survival probabilities for each of the tanks i.e. their intercept.

# extract the posterior and raw data
fixeff_post <- extract.samples(m13.1)$a 
reff_post <- extract.samples(m13.2)$a

# posterior survival summaries
tibble(tank = rep(1:nrow(tadpole), times = 2),
       mod = rep(c("fixed", "multilevel"), each = nrow(tadpole)),
       mean = inv_logit(c(colMeans(fixeff_post), colMeans(reff_post))),
       lwr = inv_logit(apply(cbind(fixeff_post,reff_post), 2, PI)[1,]),
       upr = inv_logit(apply(cbind(fixeff_post,reff_post), 2, PI)[2,])) %>% 
   ggplot(aes(x = tank, y = mean, colour = mod, shape = mod)) +
   geom_hline(yintercept = mean(inv_logit(extract.samples(m13.2)$a_bar))) +
   geom_hline(yintercept = mean(tadpole$surv/tadpole$density), colour = "red") +
   geom_errorbar(aes(ymax = upr, ymin = lwr), 
                 width = 0, alpha = 0.5, show.legend = FALSE) +
   geom_point(size = 3) +
   scale_x_continuous(breaks = 1:48) +
   scale_colour_manual(name = "", values = c("cornflowerblue","black")) +
   scale_shape_manual(name= "", values = 16:17) +
   labs(x = "Tank number", y = "Posterior survival probability") +
   theme_bw(base_size = 14) +
   theme(panel.grid = element_blank())

Here we have shrinkage i.e. we are not just fitting to the sample. Learning the regular feature of the sample. Because we have learned the prior from the data adaptively, and pooled information. First, raw mean and posterior mean \(\bar{\alpha}\) are different. The raw mean (red) is a lower survival probability. This is because we have imbalanced information across the tanks. The densities vary across tank. So here the \(\bar{\alpha}\) is higher, because we have accounted for this density differences.

The main shrinkage phenomenon is that the posterior means shrink towards the population level mean in he multilevel model. Because we have regularised, you see that for low survival tanks the mean shifts up towards the population average (black). As you move further from the mean the more shrinkage there is. This shrinkage decreases though as the density increases i.e. more data with less uncertainty. Shrinkage is the product of a lot of things in our model, but we can easily visualise it. We reduce our fit to sample to improve our predictive accuracy. This is similar to regression to the mean phenomenon that we always observe in regression, except here its because of the pooling and thus extreme values regress to the mean.

We can bring this back to poor Ulysses’s compass. Varying effects with partial pooling (some information shared - not all) are more accurate than fixed effects without pooling. The grand mean is maximally underfitting (i.e. if we ignored the tanks and heterogeneity) and also called complete pooling. The fixed effects maximally overfit (take each tank independently, but too little data to estimate each mean). These both lead to poor predictions. The varying effects do adaptive regularisation -> better predictions. We are navigating Sylla and Charybdis here adaptively.

If we now focus on \(\sigma\) i.e. the variation in our survivals. Minimum of 0 possible with dexp(1), but if it were to be actually set to 0, then there would be no variation between tanks and we converge on the complete pooling scenario. When sigma tends to infinity, we reach the no pooling scenario i.e. the fixed effects model i.e. that all the tanks are infinitely different from each other (completely independent). This is intuitive for us, the pooling that is. When we do experiments we consider this variation. The best \(\sigma\) is the regularised one that combines information of \(\alpha\) across tanks and the information in each tank. We learn the variation from the data itself.

Our inferred distribution

So what does our population look like? The difference with the previous models is that instead of just having loads of lines in our posterior (\(\alpha\)s and \(\beta\)s), we now how a full distribution of our population, captured by \(\bar{\alpha}\) and \(\sigma\), which have been generated adaptively incorporating the variation in our data. So we need both of these bits of information to build a picture of our population. We need correlated samples from our posterior. Drawing our distributions of distributions, correlated pairs of \(\bar{\alpha}\) and \(\sigma\).

post <- extract.samples(m13.2)

tad_abar <- post$a_bar[1:100]
tad_sigma <- post$sigma[1:100]

# Log odds iterating through with ggplot
p <- ggplot() + xlim(-3,4) +
   labs(x = "Log odds of survival", y = "Density") +
   theme_bw(base_size = 13) + theme(panel.grid = element_blank()) 

for(i in 1:100){
   p <- p + stat_function(fun = dnorm, alpha = 0.3,
                          args = list(mean = tad_abar[i], sd = tad_sigma[i]))
}

# Survival probability
sim_surv <- inv_logit(rnorm(8000, post$a_bar, post$sigma)) # remember we made our alpha normal
p_surv <- ggplot(data.frame(sim_surv), aes(x = sim_surv)) +
   geom_density(fill = "blue", alpha = 0.2) +
   scale_x_continuous(breaks = seq(0,1, by = 0.2),limits = c(0,1.111)) +
   labs(x = "Probability of survival", y = "Density") +
   theme_bw(base_size = 14) + theme(panel.grid = element_blank())

grid.arrange(p, p_surv, ncol = 2)

The accuracy advantage

Multilevel models are foundational to navigating the stormy seas and tackle the challenges introduced in Chapter 7. They are Ulysses’ magic compass. These shrinkage estimates are better. But here we can simulate the accuracy advantage here. We can simulate ponds here under our three assumptions of the future survival between our different ponds:

  1. Complete Pooling - one global intercept to predict all the ponds- Underfit
  2. No Pooling - each pond doesn’t tell us anything else about the others - Overfit
  3. Partial Pooling - Adaptively regularising prior - Multilevel model - Goldilocks

So lets simulate some data from these three scenarios based on the same example experiment from before.

# Model parameters
a_bar <- 1.5
sigma <- 1.5
nponds <- 60
Ni <- as.integer(rep(c(5,10,25,35), each = 15))
p_fullpool <- inv_logit(a_bar) ## the mean probability - assume this is the true value in a full pool

# Simulate pond-level intercepts (log odds of survival for each pond)
set.seed(5005)
a_pond <- rnorm(nponds, mean = a_bar, sd = sigma)

# Create data
tadsim <- data.frame(pond = 1:nponds, Ni = Ni, 
                     true_a = a_pond, 
                     true_p = inv_logit(a_pond),
                     p_fullpool = p_fullpool)

# Simulate survivors from our true intercepts
tadsim$Simsurv <- rbinom(nponds, prob = inv_logit(tadsim$true_a), 
                         size = tadsim$Ni)

# Compute no pooling survival estimates 
tadsim$p_nopool <- tadsim$Simsurv / tadsim$Ni
# Model partial pooling estimates
tadsim_list <- list(Si = tadsim$Simsurv, Ni = tadsim$Ni, pond = tadsim$pond)
m13.3 <- ulam(alist(Si ~ dbinom(Ni,p),
                    logit(p) <- a[pond],
                    a[pond] ~ dnorm(a_bar,sigma),
                    a_bar ~ dnorm(0,1.5), 
                    sigma ~ dexp(1)),
              data = tadsim_list,
              chains = 4, cores = 4,
              log_lik = TRUE)
## Trying to compile a simple C file
# Partial pooling intercepts - pond specific probabilities
tadsim_post <- extract.samples(m13.3)
tadsim$p_partpool <- inv_logit(colMeans(tadsim_post$a))

# Absolute probability error from three scenarios
tadsim_sum <- tadsim %>% 
   mutate(err_fullpool = abs(p_fullpool - true_p),
          err_nopool = abs(p_nopool - true_p),
          err_partpool = abs(p_partpool - true_p)) %>% 
   dplyr::select(pond, Ni, starts_with("err")) %>% 
   pivot_longer(starts_with("err"), names_prefix = "err_", 
                names_to = "Scenario", values_to = "Error")

tadsim_grsum <- tadsim_sum %>% 
   group_by(Ni, Scenario) %>% 
   summarise(mn = mean(Error))
## `summarise()` regrouping output by 'Ni' (override with `.groups` argument)
ggplot(tadsim_sum, aes(x = pond, y = Error, colour = Scenario)) +
   geom_hline(data = tadsim_grsum, aes(yintercept = mn, colour = Scenario)) +
   geom_point(size = 4, alpha = 0.6) +
   scale_x_continuous(breaks = 1:60) +
   facet_wrap(~Ni, ncol = 4, scales = "free_x") +
   labs(x = "Pond", y = "Absolute Error") +
   theme_bw(base_size = 14) +
   theme(strip.background = element_blank(),
         panel.grid = element_blank())

You can see that particularly at small sample sizes i.e. when there are differences in the density of data in our different groups, the absolute error for the partial pooling method with our regularising random effect is much reduced. Generally, partial pooling reduces the error relative to our true value of the survival probability.

In essence, using multilevel models that have these adaptive, pooling priors is the best way to go.

Lecture 16 - Multilevel models 2

Multiple clusters and divergent transitions

We can and in a lot of cases should have more than one type of cluster in our data for our random effects. Natural data is often pooled over both individuals and locations and time for example. We also need to investigate divergent transitions.

Lets return to prosocial chimpanzees. Left or right pull, with the presence of a partner and and which side or our food is. Answer was no with our binomial model. Here we have multiple clusters - here it is a cross classification model. We have actors and experimental block as our clusters. Block effects are a terror in science. Actors have the handedness things, and our blocks are different in time. Not nested here.

So what will our joint model look like here. Because we have hyperpriors for each of our intercept/offset terms (one for actor and one for block), the number of priors goes up a lot. The \(\beta\) doesn’t have an adaptive prior remember. However, generally it is just the same as before, but with these adaptive priors, and just priors all the way down.

\[ \begin{aligned} \text{LEFT}_{i} & \sim \operatorname{Binomial}\left(1, p_{i}\right) \\ \operatorname{logit}\left(p_{i}\right) &=\alpha_{\text {ACTOR }[i]}+\gamma_{\text {BLOCK }[i]}+\beta_{\text {TREATMENT }[i]} \\ \beta_{j} & \sim \text { Normal }(0,0.5), \text { for } j=1 . .4 \\ \alpha_{j} & \sim \operatorname{Normal}\left(\bar{\alpha}, \sigma_{\alpha}\right), \text { for } j=1 . .7 \\ \gamma_{j} & \sim \operatorname{Normal}\left(0, \sigma_{\gamma}\right), \text { for } j=1 . .6 \\ \bar{\alpha} & \sim \operatorname{Normal}(0,1.5) \\ \sigma_{\alpha} & \sim \text { Exponential }(1) \\ \sigma_{\gamma} & \sim \text { Exponential }(1) \end{aligned} \] Our \(j\)s are the sublevels of each of our variables. The \(\alpha\)s are the varying intercepts for our actors (Chimpanzee), which is adaptive i.e. a random effect. We also have our varying intercept for our Block, which is also a random effect. However, we don’t have a \(\bar{\gamma}\) here, and instead set the mean to 0. Adding this would be redundant, because then you’re using two terms to learn the intercept overall. This is still adaptive though, because we’re controlling the variance. Lets load.

data("chimpanzees")
chimpanzees$treatment <- 1 + chimpanzees$prosoc_left + 2*chimpanzees$condition
chimp_list <- list(pulled_left = chimpanzees$pulled_left,
                   actor = chimpanzees$actor,
                   block_id = chimpanzees$block,
                   treatment = as.integer(chimpanzees$treatment))

And lets build this model in ulam, good to put notes in saying what each of the sections of the model are.

set.seed(13)
m13.4 <- ulam(alist(
               # The model 
               pulled_left ~ dbinom(1, p),
               logit(p) <- a[actor] + g[block_id] + b[treatment],
   
               # Simple prior
               b[treatment] ~ dnorm(0,0.5),
               
               # Adaptive priors - our RE marker
               a[actor] ~ dnorm(a_bar, sigma_a),
               g[block_id] ~ dnorm(0, sigma_g),
               
               # Hyperpriors
               a_bar ~ dnorm(0,1.5),
               sigma_a ~ dexp(1),
               sigma_g ~ dexp(1)),
       data = chimp_list, chains = 4, cores = 4, log_lik = TRUE)
## Trying to compile a simple C file
## Warning: There were 2 divergent transitions after warmup. See
## http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
## to find out why this is a problem and how to eliminate them.
## Warning: Examine the pairs() plot to diagnose sampling problems
## Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#bulk-ess
## Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#tail-ess

Notice that we have a warning about Divergent Transitions - more on this in a moment. Let’s have a look at the output from this. This time we will look at the posterior distribution summary, and then the standard deviation terms for our two adaptive priors, \(\sigma_{\alpha}\) and \(\sigma_{\gamma}\) i.e. our random effect variances.

plot(precis(m13.4, depth = 2))

cpost <- extract.samples(m13.4)

bind_rows(tibble(term = "Block", post = cpost$sigma_g),
          tibble(term = "Actor", post = cpost$sigma_a)) %>% 
   ggplot(aes(x = post, fill = term)) +
   geom_density(alpha = 0.8) +
   scale_fill_manual(values = c("firebrick", "cornflowerblue"), 
                     name = "Varying\neffect") +
   labs(x = "Standard deviation", y = "Density") +
   theme_bw(base_size = 14) +
   theme(panel.grid = element_blank())

We can see here that the \(\gamma\) or g block effects are all very close to zero, and that the variance for our block effect is effectively 0, indicating that there isn’t much variance across blocks. Actor 2’s posterior is very wide, but all of these values are bunched up right at the boundary i.e. at 1, a[2] was a leftie.

If we then compare this to a random effects model without or block effects, the number of parameters drops by 7, but the effective number of parameters doesn’t drop as much as you would think. So it would seem that incorporating block doesn’t have benefits, but actually didn’t incur much of a cost either. The inference of the other effects remain similar in this case. Because of shrinkage. By aggressively regularising, we can have lots of parameters and still get similar inferences.

Even more clusters

A bad definition of random effects is that random effects are not fixed by the experimenter. However, the source of the variance doesn’t matter, we want the pooling for estimating benefits. If you care about better inference, use adaptive priors. Can use adaptive priors for the \(\beta\) terms too, works pretty well too. The estimates don’t change very much, we have lots of data for each treatment though.

Divergent Transitions

These are your friend. Tell you that something is numerically inefficient in the Markov Chain. Can fix them. You can sometimes fix this with changing a parameter, but sometimes need to reparameterize.

Imagine you’re on a frictionless rollercoaster. Two forms of energy that have to be conserved. If the rollercoaster moves down it looses potential energy and gains kenetic, and then when it goes up it is the other way around. Our Markov chain is like this. A divergent transition is when the rollercoaster cart comes of the track. The Hamiltonian method calculates the energy in the system. We have transitions which are our sample paths, the movement of the particle around the rollercoaster of the posterior. If the energy at the end of a transition is not the same as that at the start, then it is divergent. Indicates an inaccurate approximation. Tends to happen in regions of strong curvature, where we pop off the track. Happens a lot if parameters are conditional on other parameters too - Like in adaptive priors. Hamiltonian is good because it tells us about this - other algorithms don’t.

We can display this with the Devils Funnel, a two parameter posterior with a model like this

\[v \sim \text{Normal}(0,3)\] \[z \sim \text{Normal}(0, \exp v)\] Where for our two normal priors, the \(z\) depends on the \(v\). This creates a really steep curvature in the 2D parameter space for our posterior, and thus our transitions can become divergent, i.e. the rollercoaster can jump off the track. No single step size can help with this - need small step sizes when it is steep and bigger ones when it is shallow. Doesn’t corrupt your chain but can mean it is less efficient, and can mean your ineffective samples make a bad guess of the posterior.

Two ways to deal with it:

  1. Increase adapt-delta.
  2. Re-parametrise.

But first check if you have left a prior out or something like that. Surprisingly common.

Non-centered parameterisation

We can reparameterise by expressing our statistical model in another way, that is identical. Take this statistical model where we are investigating \(\alpha\), which is given by the following distribution

\[\alpha \sim \text{Normal}(\mu,\sigma)\] This is the centered parametrisation though, for which the distribution is dependent on more than one parameter. But we can rearrange this (identically) in a re-centered way

\[\alpha = \mu + \beta\] \[\beta \sim \text{Normal}(0,\sigma)\] They are the same, just have taken \(\mu\) out. Why do this though? Could go even further, and write this in a fully non-centered way, taking all parameter dependencies out.

\[\alpha = \mu + z\sigma\] \[z\sim \text{Normal}(0,1)\] Although these are the same mathematically, our HMC algorithm sees a different geometry, a different shape, which helps a lot with our sampling. All this because we don’t sample our parameters directly. Think of this like standardisation of our variables in the model. Why is it called centered? Just means that a distribution is conditional on one or more parameters. Fully non-centered means that we take all the conditioning out and put it in a linear model. In short we get many more effective samples from our Markov Chain if we do a non-centered parameterisation.

Reparameterising our Chimp model

So, with this information in mind we can repeat this for our chimp model, and do non-centered parameterisation so our adaptive priors are not dependent on other parameters. We do this by rewritting our linear model so that we have z-scores and sigmas inside of them. Each actor effect for example is some z-score for each chimp, multiplied by the variance of each chimp, and then our global intercept. This does not not look as intuitive, but is very useful. Our model will now look like this

\[ \begin{aligned} \text{LEFT}_{i} & \sim \operatorname{Binomial}\left(1, p_{i}\right) \\ \operatorname{logit}(p_{i}) &= \bar{\alpha} + z_{\text {ACTOR }[i]}\sigma_{\alpha} + x_{\text {BLOCK }[i]}\sigma_{\gamma} +\beta_{\text {TREATMENT }[i]} \\ \beta_{j} & \sim \text {Normal}(0,0.5), \text { for } j=1 . .4 \\ z_{j} & \sim \operatorname{Normal}(0,1), \text { for } j=1 . .7 \\ x_{j} & \sim \operatorname{Normal}(0,1), \text { for } j=1 . .6 \\ \bar{\alpha} & \sim \operatorname{Normal}(0,1.5) \\ \sigma_{\alpha} & \sim \text {Exponential}(1) \\ \sigma_{\gamma} & \sim \text {Exponential}(1) \end{aligned} \] The varying effects are now standardised (see the 0,1 normals for our two z scores - x is just chosen for notation but are both z scores). This is a very important feature of varying effects models that we need to consider.

And now for our Stan model. This will have posteriors that are the same, but there are some changes to our model code.

set.seed(13)
m13.4nc <- ulam(alist(
               # The model 
               pulled_left ~ dbinom(1, p),
               logit(p) <- a_bar + z[actor]*sigma_a + z[block_id]*sigma_g + b[treatment],
   
               # Simple prior
               b[treatment] ~ dnorm(0,0.5),
               
               # Adaptive priors - our RE marker
               z[actor] ~ dnorm(0, 1),
               x[block_id] ~ dnorm(0, 1),
               
               # Hyperpriors
               a_bar ~ dnorm(0,1.5),
               sigma_a ~ dexp(1),
               sigma_g ~ dexp(1),
               gq> vector[actor]:a <<- a_bar + z*sigma_a,
               gq> vector[block_id]:g <<- x*sigma_g),
       data = chimp_list, chains = 4, cores = 4, log_lik = TRUE)
## Trying to compile a simple C file

The linear model has changed such that our parameters go in there, but this is just a direct translation of the model equation above. Notice in particular our additional gp> parts of our model code, which is telling stan to do some additional calculations for us. Namely, it is saying that we want to combine our output from \(\bar{\alpha}\), \(z\), and \(\sigma_a\) to form our single parameter for our actor effect. The same goes for our block id effect.

precis(m13.4nc, depth = 2)
##                  mean         sd        5.5%        94.5%     n_eff     Rhat4
## b[1]    -0.1338145291 0.30042257 -0.60970642  0.344751657 1358.6031 1.0022816
## b[2]     0.3926771050 0.29955703 -0.06559330  0.879936696 1179.5050 1.0006211
## b[3]    -0.4738330686 0.30463067 -0.95219970  0.005767114 1280.4430 1.0005708
## b[4]     0.2640188005 0.29547635 -0.20556922  0.753040393 1303.7277 1.0017911
## z[1]    -0.5024181811 0.38870031 -1.12486331  0.084635614  634.6853 1.0013327
## z[2]     1.9552077605 0.59592505  1.03734569  2.940628577  916.2012 1.0021255
## z[3]    -0.6335768062 0.40721744 -1.28702001  0.011799817  651.0207 1.0012011
## z[4]    -0.6450548310 0.40200824 -1.26284727 -0.012323885  629.9960 1.0010765
## z[5]    -0.4892312073 0.38371001 -1.09373916  0.112101486  654.1643 1.0011031
## z[6]     0.0072164205 0.36425298 -0.56352293  0.570258050  678.5416 1.0006802
## z[7]     0.7910912884 0.43607727  0.12657706  1.499090848  790.4074 1.0016957
## x[1]    -0.0208604254 0.98845456 -1.64084473  1.550238386 2444.4280 0.9995703
## x[2]    -0.0097365017 1.00336108 -1.63362559  1.561128558 2461.8109 0.9991465
## x[3]    -0.0295082671 1.01796447 -1.63841435  1.568673546 2208.2098 1.0012846
## x[4]    -0.0125849982 0.94884083 -1.59434543  1.538319591 2429.2631 0.9999481
## x[5]    -0.0025402907 1.02624760 -1.64665161  1.626772554 2203.0833 1.0008895
## x[6]     0.0153919448 1.02915995 -1.61915783  1.673286594 2403.8510 1.0014470
## a_bar    0.5935843240 0.79159075 -0.64046771  1.805764371  661.2881 1.0018426
## sigma_a  2.0926655475 0.65296715  1.27044661  3.246483928  743.8465 1.0038288
## sigma_g  0.1209204878 0.09270074  0.01118936  0.288943401 1594.7836 0.9992898
## g[1]    -0.0052937695 0.15389780 -0.23580359  0.214326570 1879.4365 0.9996622
## g[2]     0.0035210283 0.15193126 -0.20272238  0.239008811 1731.8781 0.9989091
## g[3]    -0.0040019575 0.15910361 -0.24760750  0.224456456 1828.8817 1.0006642
## g[4]    -0.0042513885 0.14688421 -0.23488518  0.209936913 2036.0547 0.9990703
## g[5]     0.0001402301 0.15781407 -0.24943279  0.243228270 1878.7939 1.0005407
## g[6]     0.0034896712 0.15742661 -0.23500020  0.245107726 1966.3631 1.0034609
## a[1]    -0.3690261764 0.34936510 -0.93654040  0.188467635 1302.3093 1.0025622
## a[2]     4.5140995547 1.18648167  2.99717095  6.474990192 1872.7041 1.0001300
## a[3]    -0.6185312142 0.35100466 -1.17998326 -0.060906518 1347.6455 1.0032867
## a[4]    -0.6437962993 0.35800560 -1.19979848 -0.071724270 1417.2506 1.0011787
## a[5]    -0.3423680281 0.33565201 -0.88306670  0.221234328 1479.7219 1.0027432
## a[6]     0.6120801013 0.35980189  0.04772565  1.176283558 1373.5694 1.0028015
## a[7]     2.1259092378 0.45226710  1.44288889  2.855561457 1648.3085 0.9995548

The benefits of this re-parameterisation:

  1. The chain runs faster.
  2. Large increase to effective sample number.
  3. No divergent transitions.

Posterior predictions

Now we want to do some predictions using our multilevel models. The first thing we have to note is that because we have done partial pooling, and have shrinkage towards a population level mean, our predictions may not fit our sample as well as before. This will particularly be the case for clusters with a small sample size with a low density of data.

For actually doing the predictions themselves, the thing that the addition of random effects does is that it makes your predictions a little bit more subtle. You have to make some choices i.e. deciding what level of the model you are interested in reporting.

Are you interested in new clusters, or the same clusters?

So, we have learned about some parameters both at the population level and at the cluster level. If you’re interested in the same clusters that you have used, works very much like before i.e. with sim and link. Treat your varying effects as parameters and just push samples back through the model. This is if you want to predict for your sample essentially. Here you take into account your adaptive priors. For the chimpanzees, think of this where we predict for the chimpanzees data itself.

However, we may also want to make predictions for new clusters, for which our additional information from the adaptive priors may not be as useful and instead we’re just interested in the population-level mean. For the chimps, we would want to sample from new actors. And now Richard shows us three we have to decide whether we want to use:

  1. The average cluster - population mean
  2. The marginal of cluster - sample a bunch of individuals in our cluster and average over them incl. variation
  3. Sample of actors from posterior - Show your posterior of different individuals in the cluster as new lines.

Average actor

Here we want to take the effect at the population mean or at \(\bar{\alpha}\) in our chimpanzee example. Here, we just want to do a prediction that incorporates our population-level intercept and our treatment \(\beta\) effects. So this is useful for looking at population-level treatment effects, things that are captured in our \(\beta\) terms. I.e. this is often the average one you’ll want for papers.

# posterior samples
post_chimp <- extract.samples(m13.4nc)

# function to combine specific parts of the posterior (only treatment and a_bar here)
p_link_abar <- function(treatment){
   logodds = with(post_chimp, a_bar + b[,treatment])
   return(inv_logit(logodds))
}

# predict for our four treatments
pdat <- tibble(lab = c("Right prosocial\nNo partner", "Left prosocial\nNo partner", 
               "Right prosocial\nPartner", "Left prosocial\nPartner"),
       treatment = 1:4, mu = NA, lwr = NA, upr = NA) 

for(i in 1:nrow(pdat)){
   preds = p_link_abar(pdat$treatment[i])
   pdat$mu[i] = mean(preds)
   pdat$lwr[i] = PI(preds)[1]
   pdat$upr[i] = PI(preds)[2]
}

# plot
ggplot(pdat, aes(x = factor(lab, levels = lab), y = mu)) +
   geom_errorbar(aes(ymax = upr, ymin = lwr), width = 0.03) +
   geom_point(size = 3) +
   lims(y= c(0,1)) +
   labs(x = "", y = "Posterior proportion pulled left") +
   theme_bw(base_size = 14)

Marginal of actor - sample actors

Now we want to sample a bunch of different chimpanzees and average over them, to incorporate the variation in the cluster as well as just the mean. We can then either just plot these different simulated actors, or summarise over them to get our marginalised effect, which includes more variation.

# Simulate different intercepts i.e. individuals in our cluster from the model parameters a_bar and sigma_a
a_sim <- with(post_chimp, rnorm(length(post_chimp$a_bar), a_bar, sigma_a))

# function to combine specific parts of the posterior (a_sim and treatment)
p_link_asim <- function(treatment){
   logodds = with(post_chimp, a_sim + b[,treatment])
   return(inv_logit(logodds))
}


# 1. Marginal summary
# predict for our four treatments
pdat <- tibble(lab = c("Right prosocial\nNo partner", "Left prosocial\nNo partner", 
               "Right prosocial\nPartner", "Left prosocial\nPartner"),
       treatment = 1:4, mu = NA, lwr = NA, upr = NA) 

for(i in 1:nrow(pdat)){
   preds = p_link_asim(pdat$treatment[i])
   pdat$mu[i] = mean(preds)
   pdat$lwr[i] = PI(preds)[1]
   pdat$upr[i] = PI(preds)[2]
}

# plot
marg1 <- ggplot(pdat, aes(x = factor(lab, levels = lab), y = mu)) +
   geom_errorbar(aes(ymax = upr, ymin = lwr), width = 0.03) +
   geom_point(size = 3) +
   lims(y= c(0,1)) +
   labs(x = "", y = "Posterior proportion pulled left") +
   theme_bw(base_size = 14)

# Marginal actors
marg2 <- data.frame(sim = 1:length(post_chimp$a_bar),
              sapply(1:4, function(i) p_link_asim(i))) %>% 
   pivot_longer(-sim, names_prefix = "X") %>% 
   mutate(name = as.integer(name),
           lab = rep(c("Right prosocial\nNo partner", "Left prosocial\nNo partner", 
               "Right prosocial\nPartner", "Left prosocial\nPartner"),
               times = length(post_chimp$a_bar))) %>% 
   filter(sim <= 200) %>% 
   ggplot(aes(x = lab, y = value, group = sim)) +
   geom_line(alpha = 0.2, size = 1.3) +
   lims(y= c(0,1)) +
   labs(x = "", y = "Posterior proportion pulled left") +
   theme_bw(base_size = 14)

grid.arrange(marg1, marg2, ncol = 2)

Post-stratification

One final thing to mention is if we have non-representative samples of differing densities. We may want to account for this in our predictions even more. To do this, we can do post-stratification, where by we weight our estimates for each of our differing individuals (groups in our cluster) based on their sample size. This doesn’t always work, particularly if there is causal links between non-representative samples and our outcome.

Lecture 17 - Adventures in Covariance

Extending multilevel models

So far we have just been varying the intercepts i.e. the averages are differing by cluster. Pooling improves our estimates. Even though we bias our estimate, it is less biased actually i.e. better predictions out of sample. Any parameter in our model can be turned into a varying effect.

However, we can extend this strategy to our slopes i.e. that the effect of our predictor varies by cluster. Again, you know this already but there are lots of good reasons why we might want to allow the slope to vary in a multilevel model. Do drug levels affect different people differently, after-school programmes don’t work for everyone, different life-history trajectories for different individuals. Average effect misleading? Can be depending on the clusters - not every unit has the same response.

Could treat them the same as the intercepts with adaptive pooling priors. Just shove another adaptive prior in our linear model equation. However we can do even better though by relating our intercepts to our slope. They have covariance with each other. Learning one gives you some info about the other, as you change the slope of the line the intercept changes (e.g. higher intercepts have smaller slopes). We want to have a single prior, which brings together features of the different units in our population that has correlation between these features i.e. the intercepts and the slopes. So we want to do correlated population pooling. This correlation structure lets us transfer information across these features.

Cafe Robot

Lets demonstrate this with our cafe robot example again. Coffee waiting times between different cities again. Pool the waiting times across our clusters of different cities. However, now we want to know whether the coffee waiting times are different depending on the time of day as well. Our intercept is our average waiting time in the morning, and our slope is the change in waiting time from morning to afternoon. Will our slopes and intercepts be related? Yes. There is a negative correlation between average weight time in the morning (intercept) and our difference in morning and afternoon (slope). If you have higher morning waiting time, your slope (diff) is smaller because there isn’t a saturation effect. This correlation gives us information, which we will do well by using.

The 2-dimensional Gaussian

In previous priors, we have been using independent Gaussian distributions. However, even in our quadratic approximation examples (which feel like a long time ago) we have also been using a multivariate Gaussian distribution to capture our variance-covariance matrix. This multivariate Gaussian distribution would capture how each parameter’s posterior probability relates to others in our model. The goal here is to extend this i.e. explicitly add more info to this matrix, when we specify the model itself to capture different types of varying effect.

Now we are going to assign a 2-dimensional Gaussian distribution for both the intercepts and slopes. This works in the same way as \(\text{Normal}(\mu,\sigma)\) where we need a mean and standard deviation for our distribution. Here, we do the same thing except combine info from our slope and intercept and make this multidimensional by using:

  1. A vector of means
  2. The variance-covariance matrix

Our vector of means is just our average intercept and average slope. The variance-covariance matrix is just the multidimensional analog of \(\sigma\) i.e. a matrix where we have the spread for all of our different parameters, and takes the following form

\[ \begin{pmatrix} \sigma^{2}_{\alpha} & \sigma_{\alpha}\sigma_{\beta}\rho \\ \sigma_{\alpha}\sigma_{\beta}\rho & \sigma^{2}_{\beta} \end{pmatrix}. \]

Here, the \(\sigma\)s indicate the variance of our intercept \(\alpha\) and slope \(\beta\) terms, respectively. The \(\rho\) gives the correlation between the intercept and slope. Thus, the diagonal values are the variances in our slope and intercept, and our off-diagonals are their covariances, which is their standard deviations multiplied together and multiplied by their correlation. In general following this, the covariance of two variables \(x\) and \(y\) is defined as

\[\text {cov}(xy) = \sigma_x\sigma_y\rho_{xy}\]

The covariance of two variables more generally is specifically defined as the difference in the average product and the product of the averages. And the variance is just a special case of this where the two variables are the same thing. Based on that, we can combine this information to define the covariance of our two variables, by combining with our additional \(\rho\) which ensures that covariance is higher when our two variables are similar. This variance-covariance matrix will of course get bigger as we have more parameters that interact with each other.

Matrix multiplication

Matrix comes from the word womb/mother in latin. They are just for notation. Remember your matrix multiplication. Nice easy way to remember this from Richard - For the dimension position (e.g. first row first column) of your resulting matrix, you multiply across the row of the first matrix and down the column of the second matrix.

\[ \begin{pmatrix} A & B \\ C & D \end{pmatrix} \times \begin{pmatrix} a & b \\ c & d \end{pmatrix} = \begin{pmatrix} Aa + Bc, & Ab + Bd \\ Ca + Dc, & Cb + Dd \end{pmatrix} \]

Cafe robot simulation

Lets simulate our coffee robots results to see how this would work.

## 1. Simulate our population of cafes
# 1a. Average parameters
a       <- 3.5    # average morning wait time
b       <- (-1)   # average difference between morning and afternoon
sigma_a <- 1      # std dev in intercepts
sigma_b <- 0.5    # std dev in slopes
rho     <- (-0.7) # correlation between slopes and intercepts
N_cafes <- 20     # we simulate 20 cafes

# 1b. Our multivariate normal of effects
Mu     <- c(a,b)               # the vector of means
cov_ab <- sigma_a*sigma_b*rho  # the covariance of our parameters
Sigma  <- matrix(c(sigma_a^2, cov_ab, cov_ab, sigma_b^2),
                 ncol = 2)     # the variance-covariance matrix

# 1c. More useful way to build vcov matrix - needed later
sigmas <- c(sigma_a, sigma_b)                  # our std devs
Rho    <- matrix(c(1,rho,rho,1), nrow = 2)     # our correlation matrix
Sigma <- diag(sigmas) %*% Rho %*% diag(sigmas) # matrix multiplication of our separate components

# 1d. Simulate the properties of 20 cafes from the multivariate normal
set.seed(5)
vary_effects <- mvrnorm(N_cafes, Mu, Sigma) # gives us the intercepts and slopes, respectively
a_cafe       <- vary_effects[,1] 
b_cafe       <- vary_effects[,2] 

# 1e. Visualise our cafes
plot(a_cafe, b_cafe, col= rangi2, 
     xlab = "Waiting time (intercept)",
     ylab = "Difference between morning and afternoon (slope)")
for(l in c(0.1, 0.3, 0.5, 0.8, 0.99)){
   lines(ellipse(Sigma, centre = Mu, level = l), col = col.alpha("black"))
}

## 2. Simulate observations of different waiting times from each cafe
set.seed(22)
N_visits  <- 10 
afternoon <- rep(0:1, N_visits*N_cafes/2)
cafe_id   <- rep(1:N_cafes, each = N_visits)
mu        <- a_cafe[cafe_id] + b_cafe[cafe_id]*afternoon # mean of wait time
sigma_cf  <- 0.5                                         # std within cafe
wait      <- rnorm(N_visits*N_cafes, mu, sigma_cf)       # simulated waiting times
cafdat    <- data.frame(cafe_id = cafe_id, afternoon = afternoon, wait = wait) 

# 2a. look at the simulated data - colours are our different cafes
ggplot(cafdat, aes(x = factor(afternoon), y = wait, 
                   colour = factor(cafe_id))) +
   geom_point(show.legend = F, size = 4, alpha = 0.6) +
   labs(x = "Afternoon?", y = "Waiting time (mins)") +
   theme_bw(base_size = 13) + theme(panel.grid = element_blank())

The varying slopes model

With this simulated data we can now build our rather big looking varying slopes model.

\[ \begin{aligned} \text{WAIT}_i &\sim \text{Normal}(\mu_i, \sigma) \\ \mu_i &= \alpha_{\text{CAFE}[i]} + \beta_{\text{CAFE}[i]}\text{AFTERNOON}_i \\ \begin{bmatrix} \alpha_{\text{CAFE}} \\ \beta_{\text{CAFE}} \end{bmatrix} &\sim \text{MVNormal}\left(\begin{bmatrix} \alpha \\ \beta \end{bmatrix}, \textbf{S}\right) \\ \textbf{S} &= \begin{pmatrix} \sigma_{\alpha} & 0 \\ 0 & \sigma_{\beta} \end{pmatrix} \textbf{R} \begin{pmatrix} \sigma_{\alpha} & 0 \\ 0 & \sigma_{\beta} \end{pmatrix} \\ \alpha &\sim \text{Normal}(5,2) \\ \beta &\sim \text{Normal}(-1,0.5) \\ \sigma &\sim \text{Exponential}(1) \\ \sigma_{\alpha} &\sim \text{Exponential}(1) \\ \sigma_{\beta} &\sim \text{Exponential}(1) \\ \textbf{R} &\sim \text{LKJcorr}(2) \\ \end{aligned} \] Lets go through this beast. Outcome is the wait time which is normally distributed. For this we have our linear model on the mean, with varying intercepts and slopes for each cafe. The slope is the difference between morning and afternoon. Then we have a multivariate prior, this is our two dimensional adaptive prior. For each cafe there are two parameters, and they are distributed with a multivariate normal with a mean of the vector of means (average slope and intercept), and the variance-covariance matrix S. Then we have four fixed or non-adaptive priors here, which give the \(\alpha\), \(\beta\) and \(\sigma\)s.

LKJ prior

Finally we have a new prior, which is our correlation matrix prior. These are trickier because the values constrain each other inside the matrix. Can’t just pick arbitrary correlations. So, we use the LKJ family of priors. Lewandowski, Kurowicka and Joe. They call it the onion method. It has one shape parameter \(\eta\), which describes the concentration or dispersion from the identity matrix (zero correlations everywhere and 1s in the diagonal). If it’s really big there is little correlation, if it’s smaller then the correlations are flatter.

Now we can fit the model.

m14.1 <- ulam(alist(wait ~ normal(mu, sigma),
                   mu <- a_cafe[cafe_id] + b_cafe[cafe_id]*afternoon,
                   c(a_cafe, b_cafe)[cafe_id] ~ multi_normal(c(a,b), Rho, sigma_cafe),
                   a ~ normal(5,2),
                   b ~ normal(-1,0.5),
                   sigma_cafe ~ exponential(1),
                   sigma ~ exponential(1),
                   Rho ~ lkj_corr(2)),
              data = as.list(cafdat), chains = 4, cores = 4)
## Trying to compile a simple C file

Lets have a look at the posterior for our correlation matrix \(\textbf{R}\).

post_cafe <- extract.samples(m14.1)
dens(post_cafe$Rho[,1,2])

Here we have looked at the value of Rho in the first row and the second column, which is the correlation between the intercept and the slope (which we set at -0.7 in the simulation). This is the same value as row 2, column1. Note that here we dealing with matrix slices, so there are three positions i.e. post$R[x,y,z] or [sample,row,column]. This is a distribution of matrixes.

Now we get posterior shrinkage, but in two dimensions, for both the slope and the intercept. We’ll compute the raw unpooled estimates from the data directly, and extract the cafe-specific posterior values for them too:

## Unpooled estimates, just the means of the raw data vectors
a1 <- sapply(1:N_cafes, function(i) mean(wait[cafe_id == i & afternoon == 0]))
b1 <- sapply(1:N_cafes, function(i) mean(wait[cafe_id == i & afternoon == 1])) - a1

## Posterior samples
a2 <- apply(post_cafe$a_cafe, 2, mean)
b2 <- apply(post_cafe$b_cafe, 2, mean)

## Have to compute the inferred posterior model parameters for our posterior contours
Mu_est <- c(mean(post_cafe$a), mean(post_cafe$b))
rho_est <- mean(post_cafe$Rho[,1,2])
sa_est <- mean(post_cafe$sigma_cafe[,1])
sb_est <- mean(post_cafe$sigma_cafe[,2])
cov_ab <- sa_est*sb_est*rho_est
Sigma_est <- matrix(c(sa_est^2, cov_ab,cov_ab, sb_est^2),
                    ncol = 2)

## Shrinkage plot
plot(a1, b1, xlab = "Intercept", ylab = "Slope", pch =16, col = rangi2, 
     ylim = c(min(b1)-0.1, max(b1) + 0.1), xlim = c(min(a1)-0.1, max(a1) + 0.1))
points(a2,b2, pch = 1)
for(i in 1:N_cafes) lines(c(a1[i],a2[i]), c(b1[i], b2[i]))
for(l in c(0.1,0.3,0.5,0.8,0.99))
   lines(ellipse(Sigma_est, centre = Mu_est, level = l, 
                 col = col.alpha("black", 0.2),))

We get this cool shrinkage towards the centre of the inferred mean of the posterior for our slope and intercept. The raw data in blue is shrunk in our posterior estimates towards the mean. You can also plot the wait times on the outcome scale, giving the morning vs. afternoon wait time for coffee, by computing the equation for your \(\mu\). This is multi-dimensional shrinkage, which pools information across intercepts and slopes. It depends on the correlations between our effects.

Lecture 18 - Adventures in Covariance 2

More covariance structures

Many effects, many clusters

So now we have done a simulated example of the varying slopes model, lets apply this to some real data i.e. lets return to the left-pulling example in the chimps. Now we want to have a varying slope model, where our effect of treatment is able to vary by our actors, as well as by block (which isn’t really justified but hey ho). Our linear model section looks like this now:

\[ \begin{aligned} \text{LEFT}_i &\sim \text{Binomial}(1, p_i) \\ \text{logit}(p_i) &= \gamma_{\text{TREATMENT}[i]} + \alpha_{\text{ACTOR}[i],\text{TREATMENT}[i]} + \beta_{\text{BLOCK}[i],\text{TREATMENT}[i]} \end{aligned} \] We have our average treatment effects given by \(\gamma\), and then two varying effects where those treatment effects depend on our actor or block. So for our \(\alpha\) term for example, this is a parameter for the deviation of actor from the mean effect of treatment in a particular treatment. Each actor gets four parameters for \(\alpha\). So there are 76 parameters in total.

Our adaptive priors this time are 4-dimensional, because each of our \(j\) actors and blocks can vary in the way that treatment effects the probability of a left pull. We want a vector of four treatment effects for each actor. So these priors look like this

\[ \begin{aligned} \begin{bmatrix} \alpha_{j,1} \\ \alpha_{j,3} \\ \alpha_{j,3} \\ \alpha_{j,4} \end{bmatrix} &\sim \text{MVNormal} \left(\begin{bmatrix} 0 \\ 0 \\ 0 \\ 0 \end{bmatrix}, \textbf{S}_{\text{ACTOR}} \right) \\ \begin{bmatrix} \beta_{j,1} \\ \beta_{j,2} \\ \beta_{j,3} \\ \beta_{j,4} \end{bmatrix} &\sim \text{MVNormal} \left( \begin{bmatrix} 0 \\ 0 \\ 0 \\ 0 \end{bmatrix}, \textbf{S}_{\text{BLOCK}} \right) \end{aligned} \]

Our means for the multi-dimensional Gaussian priors are 0 because we already have our mean effects of the treatments kept in our \(\gamma\) term.

We could now fit this with our standard ulam procedure, specifying our multi-dimensional Gaussian priors and covariance matrices that vary by actor for each treatment. This would look like this

m14.1 <- ulam(alist(
                   # The linear model part
                   pulled_left ~ dbinom(1, p),
                   logit(p) <- g[treatment] + alpha[actor,treatment] + beta[block_id, treatment],
                   
                   # The adaptive multi-dimensional priors
                   vector[4]:alpha[actor] ~ multi_normal(0, Rho_actor, sigma_actor),
                   vector[4]:beta[block_id] ~ multi_normal(0, Rho_block, sigma_block),
                   
                   # The fixed priors
                   g[treatment] ~ dnorm(0,1),
                   Rho_actor ~ dljkcorr(4),
                   sigma_actor ~ dexp(1),
                   Rho_block ~ dljkcorr(4),
                   sigma_block ~ dexp(1)),
              data = chimp_list, chains = 4, cores = 4, log_lik = TRUE)

Non-centered parameterisation

However, guess what, there are divergent transitions here, because we have got parameters within our multi-dimensional adaptive priors. We want to reparameterise and factor our these dependencies like before to improve our HMC output. Recall that this is for efficiency purposes, not for inference - given enough time it should be the same estimates. However, to do this for a multi-dimensional prior is much more tricky. We have to use something called Cholesky factors which enable us to factor our parameters out. This was a guy who invented this trick, was an artillery officer in WW1, but he had made notes on the battlefield about systems linear equations. Remember we want to get z scores for our factoring out our parameters. Cholesky figured out a way to maintain a correlation structure between two independent sets of z scores with a particular transformation using the correlation you desire between them. Don’t need to know the mechanics of it, just that it works for our matrices. So lets do it

m14.3 <- ulam(alist(
                   # The linear model part
                   pulled_left ~ dbinom(1, p),
                   logit(p) <- gamma[treatment] + alpha[actor,treatment] + beta[block_id, treatment],
                   
                   # The adaptive multi-dimensional priors - non-centered
                   transpars> matrix[actor,4]:alpha <- 
                      compose_noncentered(sigma_actor, L_Rho_actor, z_actor),
                   transpars> matrix[block_id,4]:beta <- 
                      compose_noncentered(sigma_block, L_Rho_block, z_block),
                   matrix[4,actor]:z_actor ~ dnorm(0,1),
                   matrix[4,block_id]:z_block ~ dnorm(0,1),
                   
                   # The fixed priors
                   gamma[treatment] ~ dnorm(0,1),
                   vector[4]:sigma_actor ~ dexp(1),
                   cholesky_factor_corr[4]:L_Rho_actor ~ lkj_corr_cholesky(2),
                   vector[4]:sigma_block ~ dexp(1),
                   cholesky_factor_corr[4]:L_Rho_block ~ lkj_corr_cholesky(2),
                   
                   # convert back to ordinary correlation matrices from the cholesky ones
                   gq> matrix[4,4]:Rho_actor <<- Chol_to_Corr(L_Rho_actor),
                   gq> matrix[4,4]:Rho_block <<- Chol_to_Corr(L_Rho_block)
                   ),
              data = chimp_list, chains = 4, cores = 4, log_lik = TRUE)
## Trying to compile a simple C file

You can see we didn’t have any divergent transitions and that we have a really healthy number of effective samples. The full precis, which now goes to a depth of 3 because of our matrixes has a lot of samples, has all of our 76 parameters: 4 average treatment effects \(\gamma\), 4x7 varying effects on actor, our \(\alpha\) terms, 4x6 varying effects on block, our \(\beta\) terms, 8 standard deviations \(\sigma\) and then 12 free correlation parameters from R (remember the two halves of our matrix are identical, and the diagonals are just 1).

If we inspect the regularising variance priors, our sigma values, we can see that especially for block the regularisation has been really strong, with little variance in treatments across our blocks. This indicates that the shrinkage is pretty big.

Now lets pull out our correlation matrix Rho, which tells us the correlation between treatments across our actors i.e. will one chimp just pull left for all the different treatments because of their individual variation i.e. handedness, or does it vary based on the treatment.

## Actor
actor_rho_plot <- tibble(par = rownames(precis(m14.3, pars = "Rho_actor", depth = 3)),
                         rho = precis(m14.3, pars = "Rho_actor", depth = 3)$mean) %>% 
   mutate(treatment1 = rep(c("Right prosocial\nNo partner", "Left prosocial\nNo partner", 
               "Right prosocial\nPartner", "Left prosocial\nPartner"), each = 4),
          treatment2 = rep(c("Right prosocial\nNo partner", "Left prosocial\nNo partner", 
               "Right prosocial\nPartner", "Left prosocial\nPartner"), times = 4)) %>% 
   ggplot(aes(x = treatment1, y = treatment2)) +
   geom_tile(aes(fill = rho), show.legend = F) +
   geom_text(aes(label = round(rho,2))) +
   scale_fill_viridis_c(option = "B",begin = 0.1, end = 0.8,limits = c(-1,1)) +
   scale_x_discrete(expand = c(0,0)) +
   scale_y_discrete(expand = c(0,0)) +
   labs(x = NULL, y = NULL, 
        title = "Actors") +
   theme_bw(base_size = 14)

## Block
block_rho_plot <- tibble(par = rownames(precis(m14.3, pars = "Rho_block", depth = 3)),
                         rho = precis(m14.3, pars = "Rho_block", depth = 3)$mean) %>% 
   mutate(treatment1 = rep(c("Right prosocial\nNo partner", "Left prosocial\nNo partner", 
               "Right prosocial\nPartner", "Left prosocial\nPartner"), each = 4),
          treatment2 = rep(c("Right prosocial\nNo partner", "Left prosocial\nNo partner", 
               "Right prosocial\nPartner", "Left prosocial\nPartner"), times = 4)) %>% 
   ggplot(aes(x = treatment1, y = treatment2)) +
   geom_tile(aes(fill = rho)) +
   geom_text(aes(label = round(rho,2))) +
   scale_fill_viridis_c(option = "B",begin = 0.1, end = 0.8,limits = c(-1,1)) +
   scale_x_discrete(expand = c(0,0)) +
   scale_y_discrete(expand = c(0,0)) +
   labs(x = NULL, y = NULL, 
        title = "Block") +
   theme_bw(base_size = 14)

grid.arrange(actor_rho_plot, block_rho_plot, ncol = 2, widths = c(5,6))

You can see very clearly that our actors were very correlated with themselves across our 4 treatments, where as our blocks were not correlated across treatments (good thing). This suggests that the chimps are very individual, and points at the handedness effects that we have seen throughout the course.

Now finally we will do some posterior predictions for this and see how our model fits the raw data again. We’re going to do predictions kind of ignoring block by setting it at 5 (doesn’t matter as the Rhos were all 0 more or less). We’re going to do retrodictive prediction i.e. fitting to the same clusters that we observed so using link.

pred_dat <- expand.grid(actor = 1:7, treatment = 1:4, block_id = 5)

pred <- link(m14.3,data = pred_dat)
p_mu <- apply(pred, 2, mean)
p_ci <- apply(pred, 2, PI)

# raw data in right format
rawpoints <- chimpanzees %>% 
   group_by(actor, treatment) %>% 
   summarise(prop = mean(pulled_left)) %>% 
   ungroup() %>% 
   mutate(chimp = paste0("Chimp ", actor),
          prosocial = if_else(treatment %in% c(1,3) == TRUE, "right", "left"),
          partner = if_else(treatment %in% 1:2 == TRUE, "absent", "present"))
## `summarise()` regrouping output by 'actor' (override with `.groups` argument)
# The plot
pred_dat %>% 
   mutate(mu = p_mu, lwr = p_ci[1,], upr = p_ci[2,],
          chimp = paste0("Chimp ", actor),
          prosocial = if_else(treatment %in% c(1,3) == TRUE, "right", "left"),
          partner = if_else(treatment %in% 1:2 == TRUE, "absent", "present")) %>% 
   ggplot(aes(x = partner, y = mu, group = interaction(chimp, prosocial), shape = prosocial, colour = factor(chimp))) +
   geom_point(aes(colour = NULL, y = prop), data = rawpoints, size = 4) +
   geom_hline(yintercept = 0.5, linetype = "dashed") +
   geom_errorbar(aes(ymin = lwr, ymax = upr), width = 0, show.legend = F) +
   geom_point(size = 4) + geom_line(show.legend = F) +
   guides(colour = "none", shape = guide_legend("Prosocial\nchoice")) +
   labs(x = "Partner", y = "Mean posterior probability") +
   facet_grid(~chimp) +
   theme_bw(base_size = 13) +
   theme(panel.grid = element_blank())

You can see we do a much better job of fitting the individual-level variation in pulling left in this sample.

Multilevel horoscopes